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METHOD AND CTRGUTT EOR ALIGNMENT OF FLOATING POINT 
SIGNIHICANDS IK A SJMD AIU&AY AlPP 

FIELD OP THE INVENTION 

[0001] The present invention relates Co the field of nias$ively parallel processing 
systems, and more particularly to a method and apparatus for efficiently normalizing and 
a ligning the agnificand portion of a floating point number in a single instruction multi 



BACKGROUND OF THE INVENTION 

[0002] Hie following application is related to application serial number 09 /__. , 

filed on , entitled "Method and Circuit for Normalization of Floating Point 

Significant in a 5IMD Array the disclosure of which i$ incorporated by 

reference. 

[0003] The fundamental architecture used by all persoml computers (PCs) and 
workstations is generally known as the von Neumann architecture, illustrated in block 
diagram form in Eg. 1, In the von Neumann architecture, a main central processing 
unit (CPU) 10 is coupled via a system bua 11 to a memory 12, The memory 12, 
referred to herein as "main memory*, also contains the data on which the CPU 10 
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operates. In modern computer systems, a hierarchy of cache memories is usually built 
into the system to reduce the amount of traffic between the CPU 10 and the main 
memory 12. 

[0004] The von Neumann approach is adequate for low to medium performance 
applications, particularly when some system functions can be accelerated by special 
purpose hardware (e.g., 3D graphics accelerator, digital signal processor (DSP), video 
encoder or decoder, audio or musk processor. «&:.). Herow^. ybn ^?*-os£.V -^f^^bi^ 
accelerator hardware is limited by the bandwidth of the link feci the CPU/memory 
part of the system to the accelerator. The approach may be further luaited if the 
bandwidth is shared by more than one accelerator. Thus, the processing demands of 
large data sets are not served well by the von Neumann architecture. Similarly, as the 
processing becomes more complex and the data larger, the processing demands may not 
be met even \rith the conventional accelerator approach. 

[0005] Referring now to Fig, 2, an alternative to the von Neumann architecture is 
the single instruction muldpk data (SIMD) massively parallel processor (MPP) sy stem. 
A MPP system differs from a von Neumann system by using a large number of 
processors, catted processing elements (PE) 200, coupled to a communications network 
15. The communications network 15 permit each PE 200 to exchange data with other 
EEs 200. Additionally, the PBs 200 may read or write to main memory 12 via an array- 
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to-racmoiy bus 13 7 or receive commands or instructions from CPU 10 via bus 11- 
Although the CPU 10 may perform some processing, in a SIMD MPP system, the array 
of PEs 14, comprising the PEs 200 and its communications network 15, perform most 
of the computations. The CPU 10 functions in a supporting role. 

[0006] In a SIMD MPP> each PE operaces on the same instruction, at the same 
time, but on different pieces of data. Since the PEs in a SIMD an^y operate in loqkstep, 
dafci r?c^5sin/fcn.!: co.o/^-io.Op). ocNKatSoiis Cftmiot v^Tf^irric.'J h ,r br;- 1 - 7 -;- -r ■■>?••' 
done Iq c.o:ivtntio.iiai processor. Instead* each ?E caii decide whether to stCie 
result of an operation cither in an internal register or In a memory dependent upon a 
condition generated within the PE 5om data local to the PE. Tliis technique is kaown 
as "activity control" and is a very powerful method for performing data dependent 
decisions in a parallel computer which operates on a single stream of instructions. 

[0007] Most SIMD MPPs utilize relatively simple processors for PEs 200. Por 
example, short integer PEs 200, such as 8-bit integer processors may be used. SIMD 
MPPs utilize these simple processors in order to increase the number of PEs 200 which 
can be inregrated upon a single silicon die. High performance is achieved by the use of 
a large number of simple PEs 200, each operating at a high dock speed. 
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[0008] The use of short Integer PEs 200 mean that floating point operations may 
require several dock cycles to complete. In many computer systems, floating point 
numbers are often stored in a manner consistent with the IEEB-764- standard. In . 
particular, the EEEE-754 standard stores single precision floating point number as three 
binary fields taking the format of j 

(,-l)'x2 M37 >x(l.f) (1) 

wberejn: 

s is a single bir representing the sign of the floating point number. 

e is an 8-bit unsigned integer representing a biased exponent. C is 
said to represent a bused exponent because the actual exponent being 
represented is equal to e - 127. Although an 8-bit unsigned integer may 
range from 0-255, and thereby permitting exponents in the range from -127 
(i.e., -127 m 0 - 127) to +128 (i.e., 128 = 255 ■ 127), tile IEEE-754 
standard limits the range of usable exponents to exclude -127 and +128. 

l,f is a 24-bit significand field in a "normalized*' format, ie., a bit 
field in which the most significant bit (MSB) is the first digit left of the 
binary point and in which the most significant bit is set to one. Since the 

most significant bit of a normalized number is understood to be 1, there is 

no need to store the most [significant bit. 
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f 0009J Data which have biased exponents of 0 and 255 are used to represent 
special conditions and the number zero. The IEEE-754 standard represents the 
number zero nsing a biased exponent of 0 (i.c„ for the single precision format, the 
exponent equals -127) and a significant field of 0000O000O00OO0OQ0000000O 3 . (la 
the specud cases of zero and non-normalized numbers, indicated by the exponent being 
0, the most significant bit of the significand is not taken to be a 1.) 



[007.0} TJndsrth« sJy-.-m?*—-' . .^-,,^.. : ^.,r »„„..,. ..... ... . \~, 

extended predion number* *rc scored ^ ,kniJ,r Somt, libsit using diifcrent sbed 
exponents and signified. For exainpie, do u bk precision, numbers use a 10-bit biased 
exponent field with rcprcscntable exponents ranging ftom -1022 to 1023 and a 
significand having 53 tms. 



[0011] I„ order to perform arithmetic operations on floating point number stored 
in the IEEE-7S4 format, the floatingpoint numbers first need to be separated, or 
"demerged-, to extract the sign bit, die exponent, and the signified, Once these 
fields We been extracted, they can be operated npon in order to perform the ariduneuc 
operation. For example, multiplying t^o floating point number include, muluplying 
the significant and adding the exponent*. For addition and subtraction, the agnificand 
fields of both operands must be properly aligned This may require shining the 
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significand field and adjusting the exponent field of one of the operands until both 
operands have the same exponent fieJcL This process is known xs alignment. 

[0012] la conventional computer systems, alignment is normally performed using 
standard shifting; logic* such as battel shifters, Shifting logic is used in conventional 
computer systems because they have adequate speed and they do not consume a 
significant amount of silicon real estate in comparison to the other circuitry in a complex 
CPU lo; However, m a SIMD MPP 

as barrel shifters -would significantly increase the fr>?.iz of th« PTvs 200 ;?.2so be too 
slow. Ao^rdingty, there is a desire and need for -a way to uOkicrrtr/ perform aiignmeiii: 
of floating point significands in a SIMD- MPP environment. 




SUMMARY OF THE INVENTION 

[0013] The present invention is directed at a processing dement of a SIMD MPP 
which can efficiently perform the alignment process commonly used when performing 
arithmetic operations on floating point numbers. The pEs of the SIMD MPP include 
two groups of registers. One of the groups is known as the M block and includes a 
plurality of registers and logic which permits limited right shifting (e.g., 2-, 4-, and 
8- bit right shifts are supported) the contents of the registers. A method is used with 
the limited right shifting ability of the M block registers to align signincaads. The other 



00071 69 2 1 - Hay ^ 01 1 2 : 22 .: | 



/05 '01 MOW 11: 12 FAX 01159 552201 ERIC POTTER CLARES ON @014 



8 



group of registers is known as the Q block and includes a plurality of registers and logic 
which pennies limited left shifting of the content? of the registers. 
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BRIEF DESCRIPTION OP THE DRAWINGS 

[0014] The foregoing and other advantages and features of the Invention will 
become more apparent from the detailed description of the preferred embodiments of 
the invention given below with reference to the accompanying drawings in which: 

[001 S ] PIG, Z is a block diagram of a prior art von Neumann architecture 
computer system;, 

[0016 j FIG. 2 is 2 block diagram of a SDvfD MPP computer system; 

[0017] FIG, 3 is a block diagram of one oftte PEs in the SIMD MPP computer 
system in accordance with the principles of the present Invention; 

[0018] FIGS, 4A and 43B are a flow chart which illustrate how the PE of the 
present invention aligns significand data; and 

[0019] FIG. 5 is a flowchart which illustrates how the PE of the present invention 
normalizes significand data, 

DETAILED DESCRIPTION OF THE INVENTION 
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[0020] Now referring to the drawings, where like reference numerals designate like 
elements, there is shown in Fig. 3 a block diagram of a PE 200 in accordance with the 
principles of the present invention. The PE 200 is divided into several functional 
blocks, including an ALU 3 01 , which is coupled to a Node Ccunmunications Interface 
305 and a DRAM Interface 303. The Node Communications Interface 305 is used by 
the PE 200 to send and receive messages to the four other PE 200 adjacent to the 
present PE 200, over signal lines 306a. 306b, 306c, and 306d. The DRAM Interface 
303 is vf.&$.d bv the PB 200 ~r- rcrr! -r-A -T--V-.-. ~ 0 _ ! ,.....^.„ r 7 -? — >..... t -7 ?r~ 

ilso coupled to a series of "t.-~d^—., Including; n rcgistir Sic 302 tojeon ;x.-l. -a. 
scries of fiag registers 307, and a shift control register ("SCR") 360. In the exemplary 
embodiment, the SCR. 360 is an 8-bit register -with the most significant bit designated 
bit 7 and the least significant bit designated bit 0. The function of the flag registers 307 
and the SCR 360 will be explained later. The PE 200 also includes two registers blocks, 
namely the M Block 350a and the Q Block 350b. 

[0021] The M block 3S0a includes a bus called the M Bus 307a which is coupled 
to the Node Communications Internee 305. The M bus 307a is also coupled, via logic 
circuit 308a to a plurality of registers. These registers include the M3 310, M2 311, 
Ml 312, MO 313, and MS 314 registers. In some embodiments an optional a G 
register 320 may also be present. The G register 320 may be used, for example, to store 
extension bits for use in higher precision calculations. In one exemplary embodiment, 
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registers M3, 310, M2, 311, Ml 312, and MO 31 3 arc o-bit registers while register MS 
314 is a single bit register. Logic circuit 308b couples registers M3 310, M2 311, Ml 

312, M0 313, MS 314, and G 320 to Q Bus 307b, ALU 301 and DRAM Interfile 
304- The logic circuits 308a and 308b represent conventional logic circuits such as a 
network of multiplexers, which permit the registers M3 310, M2 311, Ml 313, M0 

313, MS 314, and G 320 to receive and transmit data in a manner which will be 
described in additional detail. 

[0022] Additionally, logic circuits 308a,, 30Sb are also capable of demerging: an 
IEEE-754 formatted number Into its sign, biased exponent, tod significant fields. In 
particular, the $jgn i$ stored in register MS 314, the biased exponent is stored in M3 

310, and the si gni fi cant is stored in registers M2 311 (most significant byte), Ml 312, 
and M0 31 3 (least significant byte). The logic circuits 308a, 30Sb may also be capable 
of setting registers M2 311 , Ml 312, and M0 313 to zero. Finally, logic circuits 308*, 
308b also permit data stored in registers M2 311 and Ml 312 to be right shifted in 
increments of 1, 2, 4, and 8 bits- The M registers (Le,, MS 314, M0 313, Ml 312, M2 

311, and M3 310) and the Q registers (Le,, QS 344, Q0 333, Ql 332, Q2 331, and 
Q3 330) are coupled via signal line 3 07c, This permits the contents of the M registers 
to be transferred in one clock cycle to corresponding Q registers in the Q block- 
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[0023] The Q block 3SOb is sonUar to the M block 350a. The Q block has an bus 
known « the Q bus 307b. The Q bus 307b is not coupled to the Node 
Communications Interface 305. Instead, the Q bus 307b is coupled via signal line 307c 
to the M Bus 307a of the M block 3S0a. The Q block 350b include a scries of Q 
registers, namely QS 334, Q0 333, Ql 332, Q2 331, and Q3 330, In th e exemplary 
embodiment renter QS is a single bit register while register Q0 333, Ql 332, Q2 
331, and Q3 330 are 8-bit registers. The Q block 350b has logic circuits 309a, 309b 
wfuch function in. a manner s.'xnilar to lo.«nc 30.$«.. -?0&b Mrtf-'SfV 

Occ 5u^r^r;c-,nr: d;5e.T<.trsr.* between the two sets -;.-> k:sr"c r.lr~ i-;n-- 7 -> -'."0-!- 



SOSa/SOSb, hoover, is that wbii* logic circuits- 303a, 303b permit data stored 



m 
circuits 



registers Ml and Ml to be right shifted in 1, 2, 4, and 8 bit increments, logic 
309a, 309b permit data in registers Q2 331 and Ql 332 to be left shifted, in the same 
"increments. 

[0024] The PE 200 abo includes a flag renter 307 which contain a plurality of 
flags. These flags default to being set to zero, unless a specific conditions resets them to 
one. In the exemplary embodiment there are four flags named Q2ZS, Q2Z4, Q2Z2, 
and Q2Z1, which function as described below. Flag Q2Z8 is one if all eight bits of 
register Q2 331 arc zero. Flag Q2Z4 is one if the four most significant bits of register 
Q2 331 are zero. Flag Q2Z2 is one if the two most significant bits of register Q2 331 
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are both zero. Finally, flag is one if the most significant bit of register Q2 331 is 

zero. 

[002S ] The J?E 200 performs floating point arithmetic operations by first 
demerging the two IEEE-754 formatted operands- This is done by loading the first 
operand into the M block 350a. The operand may be loaded from the Node 
Communications Interface 305 if the operand is sent from an adjacent PE 200. 
Alternatively, the operand may be loaded from the DRAM Interface 303 if the operand 
had been leaded into the msizi rncrrjory 12- As mentioned p rc^i onsly, the ]o*ri-c ci**cr*drs 
30$*, 3Q8b La M blcclc 550a demerge *u IEEE-754 fbr^t^d operand into is sign, 
biased exponent, and signiGcand fields by storing the sign field in register MS 314, the 
biased exponent in register M3 310, and the sigoificand in registers M2 311 and Ml 
312* Once the first operand has been demerged, it is transferred via signal line 307c to 
the Q block 350b. The second operand is then loaded to the M block 350a and 
demerged- Ar this point* the two demerged successive operands arc in the M block 
350a and the Q block 3S0b, 

[0026] Depending on the type of arithmetic operation which is to be performed 

addition or subtraction may require aligning the signlficand and correspondingly 
adjusting the exponent) further reformatting operation may need to be performed on 
the operands stored in the Q block 350b and M block 350*. In particular, the PE 200 
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of the present invention aligns the operands in the following manner. First the 
exponent value of the two operands arc compared by subtracting them and storing the 
result in. the shift control register (SCR) 360. More specifically; SCR *< M3 - Q3. The 
result of che calculation can be interpreted in the following manner: 

If the number stored in the SCR register 360 is equal to zero, then 
the two exponents are identical and no alignment is required. 



If the .nurrxhcir stored m the 3^0 :? ttc^^s- r_br.t: 
* 5 r,vo °P r -~^- :—vy -:.li;rr;e:; hy jljiill;- concedes of die M r — - ,y. 
- o Xi> ro the right. Tiic amount to be shiibtd is the number storcxi in the SCR. 
register 360. 

If the number stored in the SCR register 360 is less than zero, 
tiicn the two operands may be aligned by shifting the contents of tie Q registers 
330-333 to the rigfrt. The amount to be shifted is the negative of the ntmibex 
stored in the SCR register 3<S0. 
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[0027] However, as previously described, only the M block is capable of right 
shifting. Thus, if the SCR contains a negative value, the contents of the M block 305a 
and the Q block 305b needs to be swapped and the value in the SCR negated (so that it 
bccom£$ a positive number). 

[0028] The exponent of the operand stored in the M block 350a is then adjusted 
to its post alignment value. More specifically, the exponent, which is stored in MS* 
takes the following valu^; 

M3 = i\13 - SCB- (Z) 

The alignment of the significaiid Is performed according to the nine steps 
described below and illustrated in Figs* 4A and 4B as steps 400-419. 

(Step 1) If hit 7 of the SCR 360 is a one (Fig. 4A, 401), this means 

the significand scored in registers M2 311, Ml 312, and M0 313 needs to be right 
shifted by at least 128 -bits. Since the three 3 -bit registers M2 311, Ml 312, and MO 
313 store at most 24 bits, the shifted result Wi underflow if the condition is true. 
Thus, registers M2 311, Ml 312, andMO 313 are each set to *cro (Fig 4A, 402). 

(Step 2) If hit 6 of the SCR 360 is a one (Fig 4A, 403), this means 

the significand stored in registers M2 311, Ml 312, and M0 313 needs to be shifted by 
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*t least 64 bits. As with step (1), if Ac condition is true an underflow will result. Thus, 
registers M3 311, Ml 312, and MO 313 arc each set to zero (Hg 4A, 404). 

(Step 3) If bit S of the SCR. 360 is a one (Fig. 4A, 405), this means 

the significand stored in registers M2 311, Ml 312, and M0 313 needs to be shifted by 
at least 32 bits. As with steps (1) and (2), if the condition is true ar* underflow will 
result. Thus, registers M2 311, MI 312, and MO 313 arc each sex to zero (Fig 4A, 
406) 

sixiri oi ac least 16-bii* is required. As previously explained, the logic 30S only permits 
right shirting of the M block registers in increments of up to 8-bits. Thus, a 16-bit right 
shift will need to be performed as two separate 8 -bit right shifts. Thus, registers M2 
311, Ml 312, and M0 313 are each right shifted by 8-bits (Fig 4A, 408). 

(Step 5) If bit 4 of the SCR 360 is a one 0?ig. 4A, 409), this means 

the shift of at least 16-bits is required. Another 8-bit right shift is performed on 
registers M2 311 , Ml 312, and M0 313 (Hg 4A, 410) so that steps (4) and (5) 
collectively result in a 1 6-bit right shift. 

(Step 6) If bit 3 of the SCR. 360 is a one (Fig. 4A, 411), this means 

a shift of at least 8-bits 3$ required. Thus, each of registers M2 311, Ml 312, and M0 
313 is right shifted by 8-bits (Fig. 4A, 412). 
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(Step 7) 



If bit 2 of the SCR 360 is a one (Fig. 4B, 413), this means a 



shift of ax least 4r-bits is required. Thus, each of registers M2 311, Ml 312, and MO 
31 3 is right shifted by 4-bics (Fig. 4B, 414). 



shift of at least 2-bits is required. Thus, each of registers M2 51 1, Ml 312, and M0 
313 is right shifted by 2-bits (Fig. 4B, 416). 



right shifted by 1-bit (rig. 418). 

[0029] Note that logically, once any one of the conditionals in steps (1), (2), or (3) 
is met, the final result of the 9-step sequence is known when registers M2 311 , Ml 312, 
and M0 313 sure each set to zero. However, in a SIMD MPP environment, different 
PEs 200 operate on different data, using the same instruction stream. Thus, each PE 
sh o u l d exec urc each of the 9 steps described above to ensure that the data being 
operated on by each PE 200 is correctly suigned. The above described method 
therefore permits a single stream of Instructions to ?>ign IEEE-754 formatted floating 
point numbers in each PE 200 in the array 14. Each PE 200 only requires shifting 
logic, such as logic circuits 308a, 308b, which can perform 1, 2, 4, and 8-bit right shifts. 
The logic circuits 3 OS a, 308b required arc significantly smaller and faster man a foil 24- 



(Step S) 



If bit 1 of the SCR 360 is a one (Fig. 4B, 415), this means a 



fSteo 9) 



If bit 0 of the SCR 360 jts a one (Fig. 4B, 4-17V this i means * 
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bit barrel shifter, thereby permitting a larger number of PEs 200 to be integrated upon 
a single chip. la the preferred embodiment, each of the nine steps cw be performed in 
a single dock cycle, thereby requiring only 9 dock cycles to align every PE 200 in the 
arfay 14. 

[0030] Per example^ suppose the array 14 has two PE 200s, with and their registers 
are set as follows (all register values sire specified in binary); 



SCK j 0100 0001 


0000 1013 


M2 


1000 1000 


1010 111! 


Ml 


1Z00 1100 


0000 0101 


MO 


11101110 


1110 0011 



{0031 J The data in die two PEs 200 would then be aligned in the following 
manner: 



[0032] Id step (1), for both PEs 200, bit 7 of the SCR 360 is equal to zero, so i 
further processing is performed in step (1). The $tat« of the registers after step (1) is: 





First PB 


Second PE 


SCR 


0100 0001 


0000 1011 
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M3 ' 


1000 1000 


1010 1111 


Ml 


1100 1100 


0000 0101 


MO 


1110 1110 


1110 0011 



[0033] _ la step (2), for the first JpE 200, bit 6 of the SC& 360 is equal to one, so 
the contents ofM2, Ml, and MO arc each set to zero. Por the sccoud PE 200, bit 6 of 
the SCR 360 is equal to zero* so no farther processing is performed bx step (2). The 









SCR 


0100 0001 


0000 1011 


M2 


0000 0000 


1010 1111 


Ml 


0000 0000 


0000 0101 


MO 


0000 0000 


1110 0011 



[0034] In step (3), for both PEs 200, bit 5 of the SCR 360 is equal to zero so 
further processing is performed in step (3), The state of the registers after step (3) i 





First PE 


Second PE 


SCR 


0100 0001 


0000 1011 


JM2 


0000 0000 


10101111 


Ml 


0000 0000 


0000 0101 
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MO 



0000 0000 



1110 OOli 



[0035 J in step (4), bit 4 of the SCR 360 for both PEs 200 „c equal to zero so no 
furtherproccssiaffisperibnnedinstepC^. The m of the renter* after step (4) is: 





First PE 


Second PE 






0100 0001 


oooo 1011 




M2 


0000 0000 


X01O *ni 




f 

f 




i 

, i 




i 


1 I 


oooo oooo j - uio odii j 



[0036] fa ^ (5)f bk 4 of ^ SCR 36Q for ^ ^ 20() ^ ^ £q zero so no 
fbrth«p I oe C s $ in 6 i 8 p C rform e diast K p(5). The of *c regime* afterstep 



(5) is: 



f 0037] m step (6)j for * c te PE 200> bit 3 ^ SCR 360 fc ^ to ^ ^ ^ 

fUrthcr processing * performed in «ep (6). For the second PE 200. bit 3 of the SCR 







PirstPE 


Second PE 


SCK. 


0100 0001 


OOOO 1011 


M2 . 


oooo oooo 


1010 1111 


Ml 


oooo oooo 


0000 0101 


M0 


oooo oooo 


mo odii 
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360 b equal to one, so a 8-bit right shift is performed. The state of the registers after 
step (6) is: 



• 





First PE 


Second PE 


SCR 


0100 0001 


oooo ion 


M2 


oooo oooo 


oooo oooo 


Ml 


0000 oooo 


laio nil 




t 


_J 



[0033] 



in step ( /), for bodi PEs 200, bit 2 of die SCR. 360 is equal to zero so no 
further processing is performed in step (7). The state of the registers after step (7) is: 





FiistPE 


Second PE ~" 


SCR 


0100 oooi 


0000 1011 


M2 


oooo oooo 


0000 0000 


Ml 


OOOO 0000 


I 101O1U1 


M0 


oooo oooo 


OOOO 0101 



[00393 la step (S), for the first PE 200, bit 1 of the SCR 3dO is equ*l to zero so no 
farther processing is performed in step (S). Por th e second PE> bic 1 of tbe SCR 360 is 
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equal to one so a 2-hit right (Shift is performed. The state of the registers after step (8) 



is: 





First PE 


Second PE 


SCK 


0100 0001 


OOOO 1011 


M2 


oooo oooo 


oooo oooo 


MX 


oooo oooo 


OOIC 2011 


MO 


oooo oooo 





(00-10 j w. «cp (9), for both PJ2 200, bit 0 of the SCK. 360 Is -qua! to oixc so z 
bit right shift h performed in <~ch ?F, The state ofrfw renter after xhh £LaaI ircp, 
■which result in alignment for both PEs 200, is: 





Fiifct PE 


Second PE 




bioo oooi 


0000 1011 


M2 


oooo oooo 


oooo oooo 


Ml 


oooo oooo 


' 0001 0101 


M0 


oooo oooo 


1110 oooo 



[0041] Once the significand has been aligned (if necessity), the ALU 301, which is 
coupled to the M block 350* via logic circuit 308b and the. Q block 250b via logic 
citcnit 309b, can perform the arithmetic operation in an ordinary manner. For example, 
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the significance may be added, subtracted, or multiplied, Por addition and subtraction 
the exponents of the operands are equal and do not require adjustment Foe 
multiplication, the exponents are summed. The result of the arithmetic operation are 
stored in the Q block: 350b. As usual, the most significant byte of the result is scored in 
register Q2, and lesser significant bytes of the results arc progressively stored in registers 
Ql and Q0. If there are additional bits of the result which needs storings the lesser 
sigoificant bytes of the results may be stored In die G register 320 &ojd rhe M0 register 
3.13 of the M Block 350 ? and *.rfdrt?o.p?! Jftf^r sig^ijJScanr hvtc* of the .tesTiIfs Tniw &s 
Stored ia the register file. 

(0042] Thus, the present invention provides an apparatus and a method tor 
normalizing the signifies nd portion of an floating point number, such as those wfaieh 
follow the IEEE-7S4 floating point standard, m m a SJMD MPP environment. The 
present invention is advantageous in that each PE 200 of the array 14 is not required to 
have a full feature shifter, such as a barrel shifter. Instead, a foster but more limited 
shifting logic, such as logic circuits 308a, 308b, which are only capable of shifting the 
Signiflcand data by 1-, 2-, 4-, or 8- bits are used in combination with a shift control 
register 360, under a nine step procedure to align the significand. Ideally, the 
instruction or instructions which correspond to each of the nine steps cajx be executed 
^ a PB 200 in a single dock cycle. Since in a SIMD environment each PE 200 in the 
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array 14 executes the same instruction at the same time, every significant! in the array 14 
can be aligned in as little as nine dock cycles. 

[0043] Although the invention has been discussed and illustrated in the context of 
a 8-bit shift control register and shifting circuits which arc capable of shifting significand 
data by 1-, 2-, 4-, and 8- bits, the invention is not so limited and may be generalized as 
fallows: The flexibility of the righc shifting circuity and the width of the shut control 
register may be varied. The shift control register can be J+l bits wide, wherein J is a 
pOiiavc ^.-iijcgcx of at least: 7 with the most; siw.!iiric^m': ok be bus Crk V and the Acs:;t 
dgnificara; bit being bit 0, The right uxffiqg cixcuitr/ can bo capable of right shi&jig 
the signiScand by 2°, Z 1 , 2\ ... > 2 N pits, wherein Nis a range of integer* between 0 and 
M, wherein Mis a positive integer of at least 3 and wherein 2<* M > is greater tfcan the 
width of the significand. 

[0044] The generalized alignment process begins with storing: the difference 
between the exponents in the shift control register. As usual, if a negative number 
would have been stored, that number fe negated before storing and the content; of the 
register blocks are exchanged. Each bit of the shift control register is checked (from the 
mort significant bit to the least significant bit). If bit I (where I is an integer ranging 
from J to 0) is equal to one, the right stuffing circuitry performs one of three actions 
depending on the value of I. If r is greater than M+l, any attempt to right shift the 
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sigaifieand by 2 1 bits would be lengthy operation which results in an under flow. Thus, 
in these circumstances, the right shifting circuitry sets each bit of the dgruficand to zero. 
If I is equal to M+l, the right shifting circuit twice right shifts the signiflcand by 2 M bio. 
If I is less than or equal to M, the right shifting circuitry right shifts the sigruficand by 
2 M bibs. 

[0045] While certain embodiments of the invention have beta described axid 
illustrated sfeav«, the invention is not limited to these specific embodiments as 
uunicfous ii-ifvai-i-tcinons, ciianges -and subscmucr.^ c: caui. : >---lcuc c.-^rj/sy.its cm be marie 
wither- departing Irom the spirit and scope ofrh* iavsatice. Accordu^, die ^cpe of 
the present invention is not to be considered w Undted by the specifics of the particular 
structures which have been described and illustrated, but is only limited by the scope of 
the appended claims. 
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CLAIMS 

mat is claimed as new and desired ro be protected by Letters Patent of the 
United Stares is: 

L A processing element having support for oligomer* of agnificands, 
comprising: 

a first register block, said first register block including it least one first 
register for holding a first exponent and a first signiflcand of a first floitfsu* ^oin? "" 
number; 

a second register block, said second register block including at least one 
second register for holding a second exponent and a second signiflcand of a second 
floatingpoint number and a. second logie, said second logic capable of right shifting the 
signiflcand of the second floating point number and said second logic also being capable 
of setting to zero each bit in a portion of said second signiflcand to zeros; 

a shift control register; 

an arithmetic logic unit coupled to said first register block, said second 
register block, and said shift control register, said arithmetic logic unit storing in the 
shift control register a value equal to the difference between said first exponent and said 
second exponent, said arithmetic logic unit causing the second logic to right shift the 



10007169 21rMay?0T::;l 2:22 ;?| 



01 HON 11:16 FAX 01159 552201 



ERIC POTTER CLARK SON, 



27 



significant! or set to zero each bit in the portion of the significand, based upon the 
contents of said shift control register. 

2. The processing element of claim 1, wherein the second logic is capable of 
right shifting the second significand by 2 N bits, wherein N is an integer which ranges 
from zero to M, where M is a positive integer of at least 3. 

3. The processing dement of claim 2, wherein if bit J of said shift control 
register is equal to ofie, and if J is greater than Zvi+1, the portion cotrtspOiicU to the 2 J 
most significant bir 5 or r^.a Krcona sigaiiicsiic, =na 5 ni4 aritfameric logic unit cii^-es is 
second logic tc set zo =cro bit ju ^ portion, orlf&c second cs&aiikuaa is i^s 
than 2 J bits, rhe arithmetic logic unit causes ihe second logic to sec to zero each hie of 
said second significand. 

4. The processing element of claim 2, wherein if bit J of said shift control 
register is equal to one, and if J is equal to M+l, the arithmetic logic unit causes the 
second logic to twice right shift said second significand by 2 M bits; 

5. The processing clement of claim 2, wherein if bic J of said shift control 
register is equal to one, and if J is less than or equal to M, the arithmetic logic unit 
causes the second logic to right shift said second significand by 2 ; bits. 

6- The processing element of claim 2, where if bic J of said shift control 
register is equal to one, 
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if I is gpreater than M+l, then the portioA corresponds to the 2> most significant 
bits of said second significant and said arithmetic logic unit causes the second logic to 
set to zero each bit in said portion, or if the second signified is less than 2* bits, the 
arithmetic logic unit causes the second logic to set to zero each bit of said second 
significand; or 

if J is equal to M+l, the arithmetic logic unit causes the second logic to twice 
right shift said second £gni£csnd by 2 K bits; or 

ix j is icss ?na£i ot couat lo mtz snrjjjr'^rir io^z causes th-$, iccOAC ic-sdc ;c 
right shift said second signiikand by,2 r bfe, 

7. The processing element of claim 6, wherein Mis equal to 3, 

8. The processing element of claim 7, wherein J is equal to 0. 

9. The processing element of claim 7, wherein J is equal to 1. 
10- The processing dement of claim 7, v/hercin J is equal to 2, 

1 1 . The processing element of claim 7, whenrin J is equal to 3. 

12. The processing element of claim 7, wherein J is eqxial to 4. 

13. The processing dement of claim 7, wherein J is equal to S. 

14. The processing element of claim 7 y wherein J is equal to 6. 
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15. The processing dement of daim 7, wherein J is equal to 7. 

16. The processing clement of claim 1, wherein if the value is negative, the 
arithmetic logic unit causes the content of said first register block to be exchanged with 
the content of said second register block, and the arithmetic logic unit negatives the 
value before storing the value in the shift control register. 



17. 



A massively parallel processing system, comprising; 



an array of processing elements, eneh p?ece*.r>ne dcmc?r o.ftht; ?,.?r.r' 
being coupled t» said main memory and other processing cimccnts of said array, 
therein each of said processing elements comprises, 

a first register block, said first register block including at least one first 
register for holding a first exponent and a first dgnificand of a first floating point 
number; 

a second register block, said second register block including at least one 
second register for holding a second exponent and a second significand of a 
second floating point number and a second logic, said second logic capable of 
right shifting the significand of the second floating point number and said 
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second logic also being capable of setting to zero efcch bit in a portion of said 
second significand to zeros ; 

a shift control register; 

an arithmetic logic unit coupled to said first register block, said second 
register block, and $aid shift control register, said arithmetic logic unit storing in 

■the shirt control register a vziuc equal io the diSerencc between -said £ra: 
asponesK: Had said second exponent, said arithmetic logic unit f^^ng the second 

o-- .■♦•it'/, iiio -Si^i-xiLiCaiia or sci; ix> x^.ro en-Cn oil '•si -jug rcruor. or isic 

^J r '-~-' r ^ A .v2Sed upon the contents of said J^iri ccrrfrni zz&£t<jz. 

18- The massively parallel processing system of claim 17, wherein the second 
logic is capable of right shifting the second rigntflcand by 2 N bits, wherein N is an 
integer which ranges from zero to M f where M is a positive integer of at least 3 . 

19- The massively parallel processing system of claim 1 8, wherein if bit J of 
said shift control register is equal to one, and if J is greater than M+l 5 the portion 
corresponds to the 2 1 most significant bits of said second significand, and said arithmetic 
logic unit causes the second logic to set to zero each bit in said portion, or if the second 
significand is less than 2 1 bits, the arithmetic logic unit causes the second logic co set to 
zero each bit of said second significant 
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20. The massively parallel processing system of claim 18, wherein if bit J of 
said shift control register is equal to one, and if J is equal to M+l 3 the arithmetic logic 
unit causes the second logic to twice right shift said second stgnificand by 2 M bits; 

21- The massively parallel processing system of claim 18, wherein if bit J of 
said shift control register is equal to one, and if J is less than or equal to M, the 
arithmetic logic unit causes the second logic to right shift said second sigtuficand by 2 } 
bits. 




if I is greater than M+l, then the portion corresponds to the 2 1 most significant 
bits of said second significand, and said arithmetic logic unit causes the second logic to 
set to zero each bit in said portion, or if the second significand is less than 2* bits, the 
arithmetic logic unit causes the second logic to set to zero each bit of said second 
sigoificand; or 

if J is equal to M+l> the arithmetic logic unit censes the second logic to twice 
right shift said second significand by 2 H bits; or 

if J is less than or equal to M, the arithmetic logic unit causes the second logic to 
right shift said second significand by 2 J bits. 
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23. The massively parallel processing system of claim 18, wherein M is equal 

to 3. 

24. The massive parallel processing system of claim 23, wherein J equals 0. 

25. The massive parallej processing system of claim 23, wherein J equals 1. 

26. The massive parallel processing system of claim 23, wherein J equals 2. 

27. The massive pa&iiei processing system of chum, 23, whcreki J equals. S. 

28. The massive paraiici prcccssiiig system ox ci^irn 23, wherein J equals *I, 

29. The massive parallel pro^exsioc system of cbirn 23,\vhctem J eou.aU 5, 

30. The massive parallel processing system of claim 23, wherein J equals 6. 

3 1 . The massive parallel processing machine of claim 23, wherein J equal 7. 

32. The massively parallel processing system of claim 17, wherein if the value 
is negative, the arithmetic logic unit causes the content of said first register block to be 
exchanged with the content of .said second register block, and the arithmetic logic unit 
negatives the value before storing the value in the shift control register. 

33. In a processing element having a first register block including at least one 
first register for holding a first exponent and a first significa&d ©f * first floating point 
number and a second register block including at least one second register for holding a 
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second exponent and a $econd significand of a second floating point number, the 
processing element having a second logic for right shifting the second significutd by 2 K 
bits, wherein N is an integer ranging ftora zero to M, wherein Mbm integer of at least 
3, a method for aligning the second significand, said method comprising the steps o£ 

. (a) scoring in a storage control register, a value, $aid value being equal to 
second exponent register subtracted from the fir$t exponent register; 

(b) for an integer J tanging from 0 to one Isss then the v/idth. of said shut 

(1 ) if J is greater than M-vl. searing each hit In the 7/ mc^t <%ni£cant 
bits of said second significand to zero, or setting each bit in the second 
significand to zero if said second significand is less than 2 7 bits; 

<2) if J is equal to M+l 3 twice tight shifting said second significand by 
2 M bits; or, 

(3) if J is equal to or less than right shifting said second significant 
by 2 J bits. 

34. The method of claim 33, further comprising the step of: 

before sap (a)* if the value is a negative number, exchanging the contents 
of said first register block with said second register block; and 
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negativing the contents of the storage control register. 

35. The method of claim 33, herein M is equal to 3. 

36. The method of claim 35, wherein J is equal to 7. 
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ABSTRACT 

[0046] The processing elemencs of a single instruction multiple data (SIMD) 
massively parallel processor (MPP) arc provided with two register blocks. One register 
block indudes logic for performing limited left shifting, while the other register block 
includes logic for performing limited right shifting, A method is disclosed fox using the 

registers blocks with their ^socinted logic to perform floating point sigruficorid 
elignm-cnt and «orms^izs£ioji. The limited shifnng iogic occupies less d£e s??&#; tiirxi £ 
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